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ABSTRACT 

We analyze the problem of using Explore-Exploit techniques 
to improve precision in multi-result ranking systems such as 
web search, query autocompletion and news recommenda¬ 
tion. Adopting an exploration policy directly online, with¬ 
out understanding its impact on the production system, may 
have unwanted consequences - the system may sustain large 
losses, create user dissatisfaction, or collect exploration data 
which does not help improve ranking quality. An offline 
framework is thus necessary to let us decide what policy 
and how we should apply in a production environment to 
ensure positive outcome. Here, we describe such an offline 
framework. 

Using the framework, we study a popular exploration pol¬ 
icy — Thompson sampling. We show that there are differ¬ 
ent ways of implementing it in multi-result ranking systems, 
each having different semantic interpretation and leading 
to different results in terms of sustained click-through-rate 
(CTR) loss and expected model improvement. In particu¬ 
lar, we demonstrate that Thompson sampling can act as an 
online learner optimizing CTR, which in some cases can lead 
to an interesting outcome: lift in CTR during exploration. 
The observation is important for production systems as it 
suggests that one can get both valuable exploration data to 
improve ranking performance on the long run, and at the 
same time increase CTR while exploration lasts. 

Categories and Subject Descriptors 

H.4 [Information Systems Applications]: Miscellaneous 


Keywords 

exp lore-exploit, ranking, evaluation, thompson sampling 

1. INTRODUCTION 

We study “multi-result” ranking systems, i.e., systems 
which rank a number of candidate results and present the 
top N to the user. Examples of such systems are web search, 
query autocompletion (see Figure[^, news recommendation, 
etc. This is in contrast to “single-result” ranking systems 
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which also internally utilize ranking mechanisms but in the 
end display only one result to the user. 

One challenge with ranking systems in general is their 
counterfactual nature [^: We cannot directly answer ques¬ 
tions of the sort “Given a query, what would have happened 
if we had shown a different set of results?” as this is counter 
the fact. The fact is that we showed whatever results the 
current production model considered best. Learning new 
models is thus biased and limited by the deployed rank¬ 
ing model. One sound and popular approach to breaking 
the dependence on an already deployed model is to inte¬ 
grate an Explore-Exploit (EE) component into the produc¬ 
tion system [^. Exploration allows for occasionally ran¬ 
domizing the results presented to the user by overriding 
some of the top choices of the deployed model and replac¬ 
ing them with potentially suboptimal results. This leads 
to collecting in our data certain random results generated 
with small probabilities. When training subsequent rank¬ 
ing models, these results are often assigned higher weights, 
inversely proportional to the probabilities with which they 
were explored [2 17 . As theoretically justified and empiri¬ 
cally demonstrated, exploration usually allows better models 
to be learned. However, adopting exploration in a produc¬ 
tion system prompts a set of essential questions: which EE 
policy is most suitable for the system; what would be the 
actual cost of running EE; and most importantly, how to 
best use the exploration data to train improved models and 
what improvements are to be expected? 

Here, we present an offline framework which allows “re¬ 
playing” query logs to answer counterfactual questions. The 
framework can be used to answer the above exploration 
questions, allowing one to compare different EE policies 
prior to their integration in the online system. This is 
very important as running an inadequate policy online can 
quickly lead to significant money loss, cause broad user dis¬ 
satisfaction, or collect exploration data which is altogether 
useless in improving the ranking model. 

As a concrete example, we use the offline framework to 
evaluate Thompson sampling, a popular EE method which is 
simple to implement and is very effective at trading off explo¬ 
ration and exploitation [4{ |16[[T^ . We point out that in fact 
there are multiple ways of implementing Thompson sam¬ 
pling, each having different semantic interpretation. Some 
of the implementations correct for bias (calibration prob¬ 
lems) in the ranking model scores while others correct for 
position bias in the results. Naturally, employing different 
strategies leads to different costs, i.e. the price to be paid 
for exploring suboptimal results, and to different model im¬ 
provements. We also introduce two schemes for weighting 
of examples, collected through exploration, during training 
new ranking models. 
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Figure 1: Maps query autocomplete system. For the same (query) prefix “santa” the system ranks on top geo-results which it 
deems relevant to the user context. Left: San Francisco, Right: Santiago. 


Because EE can promote suboptimal results it is com¬ 
monly presumed that production systems adopting it always 
sustain a drop in the click-through-rate (CTR) during the 
period of exploration. By analyzing Thompson sampling 
policies through our offline evaluation framework, we ob¬ 
serve an interesting phenomenon: using the right implemen¬ 
tation can, in fact, produce a lift in CTR of the production 
system. In other words, the gain is twofold - the system 
collects valuable training data and has an improved CTR 
while exploration continues. 

To summarize, this paper makes the following contribu¬ 
tions: 

• We describe a novel framework for offline evaluation 
and comparison of explore-exploit policies in multi¬ 
result ranking systems. 

• We introduce several new Thompson sampling imple¬ 
mentations that are tailored to multi-result ranking 
systems, some more suitable in the case of ranking 
score bias and others in the case of position bias. 

• We introduce two simple weighting schemes for train¬ 
ing examples collected through Thompson sampling. 

• Using the framework, we show that adopting the right 
policy can achieve exploration and increase the CTR 
of the production system during exploration. 

The rest of the paper is organized as follows. In Section]^ 
we introduce the offline policy evaluation framework using 
a maps query autocompletion system as an example. Sec¬ 
tion 1^ discusses different ways of implementing Thompson 
sampling and weighting examples collected with them. Sec- 
tion[^compares the introduced Thompson sampling policies. 
We finish by discussing related works in Section 

2. EXPLORE-EXPLOIT FRAMEWORK 


2.1 A multi-result ranking system 

As a working example we are going to look into the maps 
query autocompletion service of a popular map search en¬ 
gine (Figure 1). When users start typing a “query” in the 
system they are presented with up to = 5 relevant geo 
entities as suggestions. If users click on one of the results, 
we assume that we have met their intent; if they do not, 
the natural question to ask is “Could we have shown a dif¬ 
ferent set of results which would get a click?”. As pointed 
out in the introduction, the question is counterfactual and 
cannot be answered easily as it requires showing a different 
set of suggestions on exactly the same context. This section 
describes a framework that allows answering such counter- 
factual questions. We first go over some prerequisites. 

In building ranking systems as above, the usual process 
goes roughly through the following three stages: 1) The 
query is matched against an index; 2) For all matched enti¬ 
ties a first layer of ranking, Li ranker, is applied. The goal 
of the Li ranker is to ensure very high recall and prune the 
matched candidates to a more manageable set of say a few 
hundred results; 3) A second layer, L 2 ranker, is then ap¬ 
plied which re-orders the Li results in a way to ensure high 
precision. There could be more layers with some specialized 
functionality but overall these three stages cover the three 
important aspects: matching, recall and precision. 

Building the Li ranker is beyond the scope of this work. 
Here, we focus on methodologies for improving L 2 ranking, 
namely, precision of the system. We assume that there is 
a machine learned model powering the L 2 ranker, i.e. the 
system adopts a learning-to-rank approach 

Let us now focus on the structure of the logs generated 
by the system as well as how an EE policy can be applied 
in it. Table shows what information can be logged by a 
multi-result ranking system for a query. 

Suppose that Li ranking extracted M > N relevant re¬ 
sults which then L 2 re-ranked and produced the top N = 5 
suggestions from Table We assume that, for at least a 






Position (i) 

Label (y) 

Result (r) 

Rank score (s) 

2=1 

0 

Suggestion 1 

Si = 0.95 

i = 2 

0 

Suggestion 2 

S 2 = 0.90 

i = 3 

1 

Suggestion 3 

S 3 = 0.60 

2 = 4 

0 

Suggestion 4 

S4 = 0.45 

2 = 5 

0 

Suggestion 5 

S 5 = 0.40 


Table 1: Original system: example logs for one query. 


fraction of the queries, the suggestions from the table and 
the user actions are logged by the production system. 

The first column in the table shows the ranking position 
of the suggested result. The Label column reflects the ob¬ 
served clicks — 1 if the result was clicked and 0 otherwise. 
The Result column contains some context about the result 
that was suggested, part of which is only displayed to the 
user and the rest is used to extract features to train the 
ranking model. The last column shows the score which the 
Z /2 ranking model has assigned to the results. 

2.2 Explore-Exploit in the online environment 

We assume a relatively conservative exploration process 
taking place in the online environment. Namely, it is allowed 
to replace only suggestions appearing at position i = N. 
This is to ensure that we do not generate large user dissat¬ 
isfaction by placing potentially bad results as top sugges¬ 
tions. For exploration in addition to the candidate at posi¬ 
tion i = N we choose among the candidates which Li returns 
and L 2 ranks at positions i = N + l,...,i = N + t< M, for 
some relatively small t. By doing so, we do not explore re¬ 
sults which are very low down the ranking list of L 2 as they 
are probably not very relevant. Requiring that a candidate 
for exploration meets some minimum threshold for its rank¬ 
ing score is also a good idea. Naturally, if for a query there 
are less than N’ = 5 candidates, then no exploration takes 
place for it. 

To enable EE online we also need to define a policy — a 
mechanism which selects with a certain probability a sugges¬ 
tion different from the one that the deployed ranking model 
would recommend. Different policies can be implemented. 
In Section we study several such policies and how they 
can be simulated in an offline environment. We also discuss 
how the data collected from them can be weighted suitably 
for training better ranking models; details can be found in 
Section [ 3 ^ 

2.3 Offline simulation environment 

Running the above EE process directly in the production 
environment can lead to costly consequences: it may start 
displaying inadequate results which can cause the system to 
sustain significant loss in CTR in a very short time. It is 
further unclear whether it will help us collect training exam¬ 
ples that will lead to improving the quality of the ranking 
model. We therefore want to simulate the above online pro¬ 
cess a priori in an offline system that closely approximates 
it. Here, we present such an offline system. 

The main idea of the offline system is to mimic a scaled- 
down version of the production system. Specifically, we as¬ 
sume offline that our Autocomplete system displays k < N 
results to the user instead of Ai = 5. Again, to replicate the 
online EE process from Section [2.2| different policies evalu¬ 
ated in the offline system will be allowed to show on its last 


position (i.e., on position k) any of the results from positions 
i = k ,..., N. 

To understand the offline process better, let us look into 
two concrete instantiations of the simulation environment 
which use the logged results from Table 

In the first instantiation, we set k = 2. It means that the 
offline system displays to the user two suggestions, as seen in 
Table Position i = 2 is going to be used for exploration, 
and the result to be displayed will be selected among the 
candidates at position i = 2,...,i = 5. 



Pos 

Label 

Result 

Score(s) 


Dis¬ 

played 

2=1 

0 

Sugg. 1 

Si = 0.95 

i = 2 

0 

Sugg. 2 

S2 = 0.90 

EE Can- 
didates 


i = 3 

1 

Sugg. 3 

S 3 = 0.60 

2 = 4 

0 

Sugg. 4 

S4 = 0.45 

2 = 5 

0 

Sugg. 5 

S 5 = 0.40 


Table 2: Offline system with k = 2. Logs for one query 
derived from the logs from Table Position i = 2 (in blue) 
is used for exploration. 

In the second instantiation, we set k = 3; that is, the 
offline system is assumed to display three suggestions. Po¬ 
sition i = 3 is used for exploration and the candidates for it 
are the results from the original logs at positions i = 3,4, 5. 
This setting is depicted in Table 



Pos 

Label 

Result 

Score(s) 


Dis 

played 

2=1 

0 

Sugg. 1 

Si = 0.95 

i = 2 

0 

Sugg. 2 

S2 = 0.90 

i = 3 

1 

Sugg. 3 

S 3 = 0.60 

EE Can- 
didates 


2 = 4 

0 

Sugg. 4 

S4 = 0.45 

2 = 5 

0 

Sugg. 5 

S5 = 0.40 


Table 3: Offline system with k = 3 suggestions. Logs for one 
query derived from the logs from Table Position i = 3 (in 
blue) is used for exploration. 

Suppose we use k = 2. Using only the production system 
we would display in our simulated environment “Suggestion 
1” and “Suggestion 2” and we would not observe a click as the 
label in the logs for both position i = 1 and i = 2 is zero. 
Now suppose we use the described framework to compare 
two EE policies, tti and 7 r 2 , each selecting a different result 
to display at position i = 2. For example, tti can select to 
preserve the result at position i = 2 (“Suggestion 2”) while 
7 r 2 can select to display instead the result at position i = 3 
(“Suggestion 3”). Now we can ask the counterfactual, with 
respect to the simulated system, question “What would have 
happened had we applied either of the two policies?”. The 
answer is, with 7 r 2 we would have observed a click, which we 
know from the original system logs (Table and with tti 
we would not have. If this is the only exploration which we 
perform the information obtained with 7 r 2 would be more 
valuable and would probably lead to training a better new 
ranking model. Note also that applying 7 r 2 would actually 
lead to a higher CTR than simply using the production sys¬ 
tem. This gives an intuitive idea of why CTR can increase 
during exploration as demonstrated in the evaluation in Sec¬ 
tion |4j^ 

It should be noted that our simulation environment effec¬ 
tively assumes the same label for an item when it is moved 









































Algorithm 1 Thompson Sampling for Multi-result Ranking 

1: Define buckets: P = {Pi, P 2 ,..., Pn} 

2: Initialize Beta distributions: j3i),..., B{an, Pn)} 

t> One per bucket 

3: while {ExplorationIsEnabled) do 
4: IP ■«— {Pci , ■ • ■, Pci } ^ P t> Select the buckets 

involved in the current iteration 
5: Draw 6i ~ B{ai,Pi),\/i € (ci,..., C;} 

6 : m <r- argmaxiOi 

7: Display Vm for exploration l> Vm is the result 

associated with the selected bucket Pm 
8: if {vm clicked) then 

9: am dm + e > Increment with constant 

10: else 

11: Pm pm + £ t> Increment with constant 


to position k from another, lower position k' > k. Due to 
position bias, CTR of an item tends to be smaller if the item 
is displayed in a lower position. Therefore, our simulation 
environment has a one-sided bias, favoring the production 
baseline that collects the data. While the bias makes the 
offline simulation results less accurate, its one-sided nature 
implies the results which we show in Section |4.2| are conser¬ 
vative: if a new policy is shown to have a higher offline CTR 
in the simulation environment than the production baseline, 
its online CTR can only be higher in expectation. 


Here we describe three policies based on Thompson sam¬ 
pling. Each is characterized by: 1) how it defines the buckets 
(line 1 in the algorithm); 2) What probability estimate is the 
bucket definition semantically representing. 

Sampling over positions policy. 

This is probably the most straight-forward, but not very 
effective implementation of Thompson sampling. It defines 
buckets over the ranking positions used for drawing explo¬ 
ration candidates. More specifically we have: 

Bucket definition: There are n = N — k -\- 1 buck¬ 
ets each corresponding to one of the candidate positions 
i = k,...,i = N. All of them can be selected in each it¬ 
eration so IP = P. For instance, if we have instantiation as 
per Tablej^we would have three buckets, n = 3, for positions 
i = 3, i = 4, and i = 5 — the positions of the candidates for 
exploration. 

Probability estimate: P{click\i,k). In this implementa¬ 
tion Thompson sampling estimates the probability of click 
given that a result from position i is shown on position k. 
This implementation allows for correction in the estimate of 
CTR per position. The approach allows also for correcting 
position bias. Indeed, results which are clicked simply be¬ 
cause of their position may impact the ranking model and 
during exploration we may through them away eliminating 
their effect on the system. This makes the approach espe¬ 
cially valuable in systems with pronounced position bias. 


3. THOMPSON SAMPLING FOR MULTI¬ 
RESULT RANKING 

We now demonstrate the offline framework by comparing 
several implementations of Thompson sampling, a popular 
policy due to its simplicity and effectiveness [4| |16|[T^ . In 
the process we identify two interesting observations. First, 
there are multiple ways to implement Thompson sampling 
for multi-result ranking problems. They have different in¬ 
terpretations and lead to different results. Second, if the 
“right” implementation for the problem at hand is selected, 
then Thompson sampling can refine the ranking model CTR 
estimates to yield better ranking results. The method then 
essentially works as an online learner, improving the CTR 
of the underlying model by identifying segments where the 
model is unreliable and overriding its choice with a better 
one. This in turn can lead to an important result — adopt¬ 
ing EE can entail a twofold benefit: it can collect valuable 
data for improving ranking precision, and at the same time 
lift the CTR of the production system during the period of 
exploration. 

Algorithm outlines our generic implementation of 
Thompson sampling. The algorithm closely follows the im¬ 
plementation suggested by with two subtle modifications: 
1 ) defining exploration 6 'ucfcet.Qin line 1 ; and 2 ) sampling 
from the subset of buckets relevant only to the current it¬ 
eration in line 4. We elaborate on these points below, and 
will show they are essential and lead to very different re¬ 
sults with different instantiations. In the evaluation section 
we also discuss how we set the exploration constant e (lines 
9, 11) to update the parameters of the beta distributions. 

3.1 Thompson sampling policies 

^Aka “arms” in the multi-arm bandits literature. 


Sampling over scores policy. 

In this implementation we define the buckets over the 
scores of the ranking model. Each bucket covers a particular 
score range. For simplicity let us assume that the score inter¬ 
val [ 0 , 1 ] is divided into one hundred equal subintervals one 
per each percentage point: [0,0.01), [0.01, 0.02),..., [0.99, Ij. 
For the suggested division we have: 

Bucket definition: There are n = 100 buckets one per 
score interval Pi = [0, 0.01), ..., Pioo = [0.99,1]. In each 
iteration only a small subset of these are active. In the 
example from Table only the following three buckets are 
active Pci = Pei = [0.60,0.61), Pc2 = Pie = [0.45,0.46), 
and = P 41 = [0.40,0.41). Suppose after drawing from 
their respective Beta distributions we observe that m = 61, 
i.e. we should show the first of the three candidates which 
turns out to result in a click. In this case, we update the 
positive outcome parameter for the corresponding Beta to 


061 = 061 -|- e. 

Probability estimate: P{click\s,k). In this implementa¬ 
tion Thompson sampling estimates the probability of click 
given a ranking score s for a result when shown at position 
k. In general, if we run a calibration procedure then the 
scores are likely to be close to the true posterior click prob¬ 
abilities for the results 15 , but this is only true if we look 
at them agnostic of position. With respect to position i = k 
they may not be calibrated. We can think of Thompson 
sampling as a procedure for calibrating the scores from the 
explored buckets to closely match the CTR estimate with 
respect to position i = k. 


Sampling over scores and positions policy. 

This is a combination of the above two implementations. 
Again we assume that the score interval is divided into one 
hundred equal parts [0,0.01), [0.01,0.02),..., [0.99,1]. This, 
however, is done for each candidate position i = k,... ,i = 









Model improvement. Maps Autocomplete with k=2 suggestion 


Model improvement. Maps Autocompiete with k=3 suggestions 



Number of training Autocomplete(AC) Queries 

Figure 2: Model improvement for an Autocomplete system 
displaying k = 2 results to the user. 



Number of training Autocompiete(AC) Queries 

Figure 3: Model improvement for an Autocomplete system 
displaying k = 3 results to the user. 


N. That is, we have: 

Bucket definition: {N — fc + 1) * 100 buckets. For more 
compact notation let us assume that bucket covers enti¬ 
ties with score in the interval [s, s + 0 . 01 ) when they appear 
on position i (here q = [lOOsJ -|- 1). In the example from 
Table we have n = 300 buckets and for the specific itera¬ 
tion the three buckets to perform exploration from are Pgi, 
Pjg, and P 41 . 

Probability estimate: P{click\s,i,k). In this implemen¬ 
tation Thompson sampling estimates the probability of click 
given a ranking score s and original position i for a result 
when it is shown at position k instead. This differs from the 
previous case as in its estimate it tries to take into account 
the position bias, if any, associated with clicks. 

These are definitely not all buckets that could be defined. 
Depending on the concrete system there may be others that 
are more suitable and lead to even better results. The point 
we are trying to convey is that how one dehnes buckets can 
vary and it plays a crucial role for the success of the explo¬ 
ration process. 

3.2 Example weighting 

Once we have an EE procedure in place a natural question 
to ask is how to best use the exploration data to train im¬ 
proved models. Here we introduce two schemes for assigning 
training weights to examples collected through exploration. 


Propensity based weights. 

In training new rankers it is a common practice to re¬ 
weight examples selected through exploration inversely pro¬ 
portional to the probability of displaying them to the user. 
The probability of selecting an example for exploration is 
called propensity score [12| . More specihcally, if we denote 
the propensity score for example Xj with p{xj), then its 
training weight is set to Wj = ^ . 

Computing propensity scores for Thompson sampling in 
the general case of more than two Beta distributions involved 
in each iteration {I > 2, line 5 in Algorithm[^ does not have 
an analytical solution [^. We thus derive the following em¬ 


pirical estimate: if we draw for exploration Xj from bucket 
Pi (line 6 of the algorithm) then we set p{xj) to the ratio 
between examples that have been explored thus far from Pi 
over the sum of all examples explored from buckets in P. 


Multinomial weights. 

We also analyze a different, very simple weighting scheme 
based on the scores of the baseline ranking model. Let Xi is 
the result displayed to the user from bucket Pi and let its 
ranking score be Si. If Xj is the selected example for explo¬ 
ration then we first compute the “multinomial probability” 


p{Xj) = 


7 . The weight is then computed again 


If in Table 3 we 


!■ 


Eie{ci,...,c,}' 

as the inverse proportional Wj = ^ . 

have selected for exploration the example at position T"= 3 
then its probability is q 40 ~ weight 

is = 2.41. We call this weighting scheme multinomial 
weighting. 

In both weighting schemes we cap the weight assigned to 
an example to avoid stressing excessively a single piece of 
evidence as suggested in previous work [2||12[[T7]. 


4. POLICY EVALUATION AND COMPAR¬ 
ISON 

The proposed offline framework allows us to evaluate dif¬ 
ferent EE policies, such as the Thompson sampling poli¬ 
cies described in the previous section. We mainly focus on 
two aspects: 1) Expected ranking model improvement; 2) 
Change in CTR during exploration. We start with our ex¬ 
perimental setup. 

4.1 Evaluation setup 

Data and Baseline Model: We use maps autocom¬ 
plete (AC) logs spanning over two months collected by the 
production model. The first month is used to simulate on¬ 
line exploration, as described in Section [2.3[ to compute the 
expected CTR loss and subsequently to train new ranking 
models. The second month is used to test and report perfor¬ 
mance results for the models trained on data from the first 
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Figure 4: CTR lift during exploration for k = 2 - lOOK and Figure 5: CTR lift during exploration for A: = 3 - lOOK and 
IM AC Queries. IM AC Queries. 


month. 

Our baseline, No Exploration method (Figures and |^, 
closely replicates the L 2 ranking model used in the produc¬ 
tion system — a variant of boosted regression trees . The 
model adopts a point-wise learning-to-rank approach, that 
is it is trained on individual suggestions and assigns a clicka- 
bility score to each result in scoring time optimizing squared 
loss. Point-wise models have been shown to be very compet¬ 
itive on a variety of ranking problems [^. It is important to 
note, though, that the ideas presented here are agnostic to 
the exact L 2 ranker deployed in production. For instance, 
we obtained directionally similar results with more elaborate 


pairwise rankers optimizing NDCG 20 


Evaluation process: We measure the performance for 
datasets of sizes lOK, 50K, lOOK and IM queries which we 
sample uniformly at random from the training month. Keep 
in mind, that each query leads to up to five suggestions, 
i.e. for the largest set of IM queries we train on over 2M 
individual examples. Each example is represented as a high 
dimensional vector of features capturing information about 
the result and the user context — static rank of the result, 
similarity to the query, distance to the user, etc. 

For each of the query sets we run ten times the following 
steps, summarizing in the figures the mean and the variance 
across the ten runs. 

Step 1. We choose a setting for the results to be displayed 
by the offline system k = 2 (Figures and and fc = 3 
(Figures]^ and 1^. 

Step 2. For the No Exploration baseline we take the top 
k suggestions and train the baseline model on them. If the 
production model is trained on only the top k results from 
its logs it will achieve exactly the same performance as the 
No Exploration model from the plots. 

Step 3. On the same set of queries, used to train the 
baseline model, we run the discussed Thompson sampling 
policies, with n buckets as described in Section [3.1[ dividing 
the scores interval into 100 equal subintervals as mentioned 
in the same section. The policies involve a stochastic step 
- each of them decides randomly to place on position i = k 
one of the suggestions from positions i = k,k + 1, as 

described in Sectionf2.3| The resulting top k suggestions are 


what a scaled down version of the production system would 
log if it utilizes the respective policy. We use this data to 
train the same type of model conhgured identically as the 
baseline model. Therefore, the improvement observed on 
the test period can be attributed entirely to the exploration 
data collected through the policies. 

Step 4. If a query has fewer than, or equal to, k results in 
the production logs we add it to the training set as is because 
in this case there are no results to explore from. We estimate 
that approximately 60% of the queries have more than two 
results and approximately 50% have more than three results. 
These are the queries which contribute to exploration for the 
two settings of fc = 2 and k = 3 respectively. 

Step 5. Finally, in training the models, we re-weight the 
examples collected with the exploration policies using the 
weighting schemes discussed in Section [3.2[ 

4.2 Evaluation of policies 

Let us now discuss in some more detail the results from 
the different experiments. 

Model improvement. Figuresandshow the model 
improvement achieved with different policies for k = 2 and 
k = 3 suggestions respectively. CTR is measured as the 
fraction of queries that would result in a click over all queries 
from the test month in an Autocomplete system that would 
display up to k suggestions. 

Naturally, having more data to train on improves the per¬ 
formance of all models. The CTR for all models for k = 2 
(Figure]^ ranges from 0.87 to 0.92 while for A: = 3 it is be¬ 
tween 0.94 and 0.96 (Figure]^. The larger CTR for k = 3 
is due to two factors: 1) For k — 3 the system displays more 
results which are more likely to contain the clicked sugges¬ 
tion from the production logs; 2) There is difference in the 
number of training examples too. For instance, for k = 2 
we still use IM queries in the largest experiment, but each 
query contributes with up to two results, i.e. we have ap¬ 
proximately 1.5M training examples in total. For k — 3 
there are approximately 2.5M examples for the same set of 
queries. 

We can see from the figures that in both cases the poli¬ 
cies which account for bias in the model scores, Thomp- 
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son over Scores {Scores for simplicity) and Thompson over 
Scores&Positions {S cores & Positions for simplicity), achieve 
large improvement over the production No Exploration sys¬ 
tem. The advantage of the Scores policy is especially strik¬ 
ing for k = 2 yielding a consistent 1.5% model improve¬ 
ment across all query set sizes. In fact, using Scores or 
Scores&Positions with just 50K training queries results in 
models which have better test performance for both settings 
of k than the production No Exploration model trained on 
IM queries! 

What is the reason for Scores to perform better than 
Scores&Positions, and for both of them to outperform Po¬ 
sitions which barely improves on the production model for 
IM queries? We believe the answer is twofold. 

First, there is a not too pronounced position bias associ¬ 
ated with the problem. In web search often multiple results 
are relevant to a query. Showing them on different positions 
frequently yields different CTR. Here the intent is usually 
very well specified. Users are mostly interested in one par¬ 
ticular geo-entity, e.g. a specific restaurant, park or a resi¬ 
dential address. There are also only up to five suggestions 
per query which are easy to inspect visually. 

Second, there is very stable CTR associated with each po¬ 
sition. Higher positions have higher CTR. Thompson sam¬ 
pling policies that utilize position information quickly iden¬ 
tify this and start apportioning most of the exploration to 
the optimal position. Therefore, exploration stops too early. 

Exploration rate e. For all experiments we use explo¬ 
ration rate e = 1 (line 9, 11 in Algorithm [^. Decreasing e 
forces Thompson sampling to continue exploration longer. 
As we mentioned above. Positions stops exploration too 
early. To force further exploration, for this policy only, we 
decrease the rate to e = 0.01. Still, as we see in Figures 
and [^Positions only marginally improves on No Exploration 
for IM examples. It should be noted that one cannot de¬ 
crease e dramatically because then there is significant drop 
in CTR during exploration, which might be too steep of a 
price to pay for model improvement. 

Weighting scheme. We repeated all experiments from 
Figures and with both weighting schemes discussed in 


Section [3.2| Though propensity based weighting is consid¬ 
ered a more suitable weighting scheme in the literature , 
for our multi-result ranking problem we observed a different 
outcome. For k — 2 the multinomial weighting produced 
over 0.5% improvement compared to propensity weights. 
Multinomial weighting was also slightly better, though less 
pronounced, in the case of fc = 3. All model improvement 
results presented here reflect multinomial weighting. 

CTR lift in exploration. So far we saw that we can 
train models that outperform the production one by using 
exploration data from specific policies. We now look into 
what price the production system will pay for this model im¬ 
provement. The common understanding is that as EE selects 
potentially suboptimal results production systems should al¬ 
ways sustain a drop in CTR during the period of exploration. 

Figure and demonstrate one of the major contribu¬ 
tions of this work. They show that the above assumption is 
not necessarily true. On the contrary, if the right policy is 
selected multi-result ranking systems can even record a lift 
in their CTR. Here, CTR lift during exploration is defined 
as the CTR of each policy minus the CTR of the produc¬ 
tion No Explore system both computed during the training 
month. 

The Positions policy indeed impacts the CTR of the pro¬ 
duction system negatively: 1.2% and 0.3% drop in CTR for 
k = 2, lOOK and IM queries Figure and 0.6% and 0.2% 
drop in CTR for k = 3 Figure]^ The results are for e = 0.01, 
yet even for e = 1 we observe a drop in CTR. 

With the Scores and Scores&Positions policies, however, 
we observe stable lift in CTR especially as the query set size 
grows to IM. For k = 2 and dataset size of IM queries the 
policies improve the production CTR with 0.5% and 0.8% 
Figure^ For k — 3 the lift is approximately 4.2% and 4.3% 
Figure The reason that for fc = 3 the lift is lower is due 
to the fact that in this setting there is a smaller pool of 
candidates for the system to explore — it runs exploration 
among only three positions k = 3, k = 4 and k = 5. 

Finally, Figure]^ and Figureshow from which positions 
the selection was performed during exploration in the case 
of fc = 2 and fe = 3 respectively. As can be seen, Positions is 






























































very conservative and explores mostly the optimal position 
i = 2 in Fignre|^and i = 3 in Figurej^ when the number of 
examples increases to IM, even though we have set e = 0.01. 
Another interesting observation is that Scores&Positions is 
more conservative, selecting fewer examples from subopti- 
mal positions. As we saw above, this leads to a greater 
CTR lift dnring exploration than Scores. However, it also 
produces less of an improvement in the ranking model. 


5. RELATED WORK 

Explore-exploit techniques hold the promise of improving 
machine-learned models by collecting high-quality, random¬ 
ized data. A common concern in production teams, however, 
is that they may invest resources integrating EE in their sys¬ 
tems, suffer high cost during the EE period, and end up with 
data that does not lead to substantially better models. To 
address this concern multiple efforts have focused on build¬ 
ing offline systems that try to quantify a priori the effects 
that EE will have on the system [2)[^[T2}(^[^. All of 
these works focus on how data collected with a production 
model or another policy 10 can be used to estimate a priori 


the performance of a new policy. Under the assumption of 
stationary data distribution, it can be proved that weighting 
data inversely proportional to the propensity scores leads to 
unbiased offline estimators, i.e. models for which we can 
provide guarantees will behave in a certain way in produc¬ 
tion [^[^. Many recent works adopt the evaluation ap¬ 
proach (e.g., 2pA ). While theoretically sound, these offline 
frameworks make a few assumptions which may not always 
be present. For instance, as we noted earlier, in certain cases 
the propensity scores can not be computed in closed form, 
e.g. in Thompson sampling so one needs to use approxi¬ 
mations as the one implemented here. Another problem lies 
in the fact that it is often impractical to assume station¬ 
ary distribution — for instance we find in our data many 
seasonal queries, queries which result from hot geo-political 
news etc, all of which impact the CTR of the system signif¬ 
icantly. 

Our work can be considered a special case of the generic 
Contextual Bandid framework [^[^[^[^. Unlike these 
works where the context is assumed to come in the form of 
additional observations or features, e.g. personalized in¬ 
formation 13 , the context in our case is in the rich structure 


of the problem. For example, most of the works mentioned 
in this section focus on the single result case, i.e. there 
are fc-arms to choose from but ultimately only one result 
is displayed to the user. We focus on multi-result ranking 
systems instead. As we saw here many real-world problems 
follow the multi-result ranking settings, which require spe¬ 
cial handling due to presence of position and ranking score 
bias. More recently has discussed the multi-result set¬ 
ting. The authors describe a non-stochastic procedure for 
optimizing a loss function which is believed to lead to proper 
exploration. While the work is theoretically sound it does 
not show whether the approach leads to improvement of the 
underlying model. The method also lacks some of the ob¬ 
served convergence properties of Thompson sampling which 
starts apportioning examples to buckets which it overtime 
finds likely to lead to higher CTR. 

The effectiveness of Thompson sampling has been noted 
previously by mm and others. Subsequently efforts have 
focused on understanding better the theoretical properties 
of the algorithm (e.g., i) leaving aside the important imple¬ 


mentation considerations which we raised, namely, that in 
the context of multi-result ranking there are multiple ways to 
define the buckets (arms), and that different definitions lead 
to different semantic interpretation and different results. 

6. CONCLUSION 

We presented an offline framework which allows evaluation 
of EE policies prior to their deployment in an online envi¬ 
ronment. The framework allowed us to define and compare 
several different policies based on Thompson sampling. We 
demonstrated an interesting effect with significant practical 
implications. Contrary to the common belief, that a pro¬ 
duction system often has to pay a price of (possibly steep) 
CTR decrease during exploration, we show that the opposite 
can happen. If implemented suitably, a Thompson sampling 
policy can, in fact, have twofold benefits: it can collect data 
that improves the baseline model performance significantly 
and at the same time it can lift the CTR of the production 
system during the period of exploration. 
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